Skip to content

Multi-Domain#105

Merged
rafapi merged 167 commits intomainfrom
multi-env
Feb 4, 2026
Merged

Multi-Domain#105
rafapi merged 167 commits intomainfrom
multi-env

Conversation

@rafapi
Copy link
Collaborator

@rafapi rafapi commented Nov 17, 2025

Enables simultaneous training across multiple domains (math, coding, function calling) with domain-agnostic orchestration

Architecture

Component Role
multidomain/loader.py Parses :: syntax, concatenates datasets, injects domain field into each sample
dispatcher.py Routes problem["domain"] → domain-specific rollout callable via actor.domain_rollouts mapping
domain_sampling.py Weighted sampling with adaptive rebalancing based on completion ratios to maintain target mix despite varying rollout latencies

Configuration

  actor:
    domain_mix:        # sampling weights (normalised at runtime)
      math: 0.4
      coding: 0.3
      fn_calling: 0.3
    domain_rollouts:   # domain → rollout function mapping
      math: pipelinerl.domains.math.generate_math_rollout
      coding: pipelinerl.domains.coding.generate_coding_rollout
    domain_system_prompts:  # per-domain system prompts
      coding: "You are an expert Python programmer..."

  train_dataset_names:
    - math::open_reasoner_zero_57k
    - coding::coding@train

Adaptive Sampling

DomainWeightedSampler adjusts weights dynamically: adjusted_weight = base_weight × (target_ratio / actual_ratio), clamped to [0.1, 10.0]. This compensates for domains with slower rollouts (e.g. coding sandbox execution) to maintain the configured mix in the output stream.

Metrics

Per-domain stats: domain_mix_actual/{domain}, domain_mix_target/{domain}, domain_mix_count/{domain}.

Copy link
Collaborator

@ehsk ehsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall except a couple of minor things!

@ehsk
Copy link
Collaborator

ehsk commented Feb 4, 2026

Here's also a sanity check to compare pre-multi-domain (displayed as ref below, orange in the top row and light blue at the bottom) vs. multi-domain on only MATH:

Reward Entropy
GRPO
GSPO

@rafapi rafapi merged commit 39830c4 into main Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants